Skip to content

Categorize internal failures with a static failure category#4972

Merged
dejanzele merged 4 commits into
armadaproject:masterfrom
dejanzele:categorize-internal-errors
Jun 25, 2026
Merged

Categorize internal failures with a static failure category#4972
dejanzele merged 4 commits into
armadaproject:masterfrom
dejanzele:categorize-internal-errors

Conversation

@dejanzele

@dejanzele dejanzele commented Jun 19, 2026

Copy link
Copy Markdown
Member

Armada-generated job-run failures previously carried no failure_category/failure_subcategory. Only operator-classified pod failures (via the executor categorizer) were tagged, so any dashboard attributing "why did this run end" had a large unlabeled remainder for everything Armada itself decided.

This change stamps the failures Armada itself authors with a static, code-owned category internal plus a subcategory naming the cause. The boundary is the source of the error. internal covers errors Armada produces from its own logic, where it writes a fixed, self-authored message. Errors where Armada merely relays dynamic external content (the kubelet, the K8s API, the scheduler) are left to the operator categorizer, which attributes them into the operator's own categories (infra, user_error, and so on) using its configured default when no rule matches. An operator never has to match Armada's internal error strings, and the top-level value reads as a triage signal: internal means the failure was Armada's own machinery, not the workload.

Stamped internal:

  • Scheduler: lease expiry, max-runs-exceeded, job rejection.
  • Executor: failed job creation, reconciliation pod-missing, and the Armada-detected structural pod issues it authors a fixed message for (stuck-terminating, externally-deleted, active-deadline, issue-handler-error).

For the structural pod issues the categorization is deterministic: the categorizer is not consulted, so a configured rule cannot override internal. The only fallback anywhere is the categorizer's own defaultCategory, which applies to the external causes below.

Not stamped internal:

  • Stuck-starting-up and unschedulable pod issues (image pull, scheduling) run through the operator categorizer, which attributes their external cause.
  • Lease returns and failed pod submissions wrap the kubelet or K8s API message. These paths do not run the categorizer yet, so they currently carry no category. Wiring the categorizer into them is follow-up work (see below).
  • Preemption is left uncategorized. It is a scheduling action rather than a failure, is metered separately, and does not flow through the failed-run path.

The subcategory vocabulary lives in internal/common/errormatch (a dependency-free leaf already holding the sibling condition constants), so both the executor and the scheduler stamp from one shared set without an import cycle.

No proto change, no migration. The change populates the existing Error.FailureCategory/Error.FailureSubcategory proto fields. One observable metric effect: the executor failure counter armada_executor_job_failure_category_total now reports internal for the structural pod issues, which previously emitted no category (the counter is a no-op on an empty category). This is intentional.

Where the stamps are visible today:

  • Lease expiry, failed job creation, reconciliation pod-missing, and the structural pod issues are JobRunErrors and reach the Lookout job_run.failure_category / failure_subcategory columns.
  • Max-runs-exceeded and job-rejected are JobErrors (not JobRunErrors), so they do not reach the Lookout job_run columns. The api event conversion only copies these fields on the PodError arm, so they do not reach the api stream either. Their stamps are not observable anywhere today. They are set for construction-site consistency and to be ready when JobErrors persistence or the api conversion is extended.

Follow-up (separate PR): an onPodIssue structured matcher so operators can categorize the external Armada-detected causes (stuck-starting-up, unschedulable, and optionally the structural ones) into their own categories without regexing Armada's messages, plus running the categorizer on the lease-return and submission paths.

@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR stamps Armada-generated job-run failures with a static failure_category of "internal" and a matching failure_subcategory, filling a gap where only operator-classified pod failures previously carried any category. The boundary is clear: Armada-authored fixed messages get internal, while errors relaying external content (kubelet, k8s API, scheduler) continue through the operator categorizer or remain uncategorized pending follow-up work.

  • New constants (CategoryInternal + nine subcategory values) are added to the dependency-free errormatch package, giving both executor and scheduler a single shared vocabulary with no import cycle.
  • Executor changes: CreateMinimalJobFailedEvent gains failureCategory/failureSubcategory params (both call sites updated); handleNonRetryableJobIssue routes structural pod issue types through a new internalSubcategoryForPodIssueType helper, bypassing the operator classifier entirely for those types; StuckStartingUp, UnableToSchedule, and FailedStartingUp fall through to the classifier as before.
  • Scheduler changes: leaseExpiredError, the MaxRunsExceeded run error, and the JobRejected error are each stamped at their construction sites; the PR notes that MaxRunsExceeded and JobRejected are JobErrors not yet surfaced by Lookout, so their stamps are for construction-site consistency only.

Confidence Score: 5/5

Safe to merge; the change is purely additive, populating existing proto fields that were previously left empty, with no schema or migration required.

All seven error-construction sites are correctly updated, the classifier bypass is gated on a well-tested helper function, and both executor and scheduler test suites have new assertions verifying the stamps end-to-end. The only previously-uncovered gap (the non-exhaustive switch) was already flagged in an earlier review thread.

No files require special attention; the logic in pod_issue_handler.go is the most complex change but is well-covered by the renamed structural-issue test and the per-type unit test for internalSubcategoryForPodIssueType.

Important Files Changed

Filename Overview
internal/common/errormatch/types.go Adds CategoryInternal constant and nine subcategory constants to the shared errormatch package; clean addition with no logic changes.
internal/executor/reporter/event.go Extends CreateMinimalJobFailedEvent signature with failureCategory/failureSubcategory params and wires them into the PodError; preemption/submit path still correctly leaves category empty.
internal/executor/service/pod_issue_handler.go Adds internalSubcategoryForPodIssueType helper and wires it into handleNonRetryableJobIssue; structural issues bypass the classifier and get internal stamps; external/ambiguous types fall through to classifier as before.
internal/executor/service/pod_issue_handler_test.go Renames and rewrites the classifier test to assert internal stamps override classifier on structural issues; removes the active-deadline classifier test case correctly since that path no longer goes through the classifier.
internal/scheduler/scheduler.go Stamps LeaseExpired, MaxRunsExceeded, and JobRejected errors with the matching internal subcategories at three separate construction sites.
internal/scheduler/scheduler_test.go Adds assertInternalFailureCategories helper hooked into TestScheduler_TestCycle to verify all three scheduler-stamped error types carry correct category/subcategory in published event sequences.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod issue detected] --> B{handleNonRetryableJobIssue}
    B --> C{internalSubcategoryForPodIssueType}
    C -->|StuckTerminating| D["internal / stuck-terminating"]
    C -->|ExternallyDeleted| E["internal / externally-deleted"]
    C -->|ErrorDuringIssueHandling| F["internal / issue-handler-error"]
    C -->|ActiveDeadlineExceeded| G["internal / active-deadline"]
    C -->|StuckStartingUp / UnableToSchedule / FailedStartingUp / default| H[Operator Classifier]
    H --> I["operator category / subcategory"]
    J[Scheduler events] --> K{Error type}
    K -->|LeaseExpired| L["internal / lease-expired"]
    K -->|MaxRunsExceeded| M["internal / max-runs-exceeded"]
    K -->|JobRejected| N["internal / job-rejected"]
    O[Executor: job creation failed] --> P["internal / job-creation-failed"]
    O2[Reconciliation: pod missing] --> P2["internal / pod-missing"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Pod issue detected] --> B{handleNonRetryableJobIssue}
    B --> C{internalSubcategoryForPodIssueType}
    C -->|StuckTerminating| D["internal / stuck-terminating"]
    C -->|ExternallyDeleted| E["internal / externally-deleted"]
    C -->|ErrorDuringIssueHandling| F["internal / issue-handler-error"]
    C -->|ActiveDeadlineExceeded| G["internal / active-deadline"]
    C -->|StuckStartingUp / UnableToSchedule / FailedStartingUp / default| H[Operator Classifier]
    H --> I["operator category / subcategory"]
    J[Scheduler events] --> K{Error type}
    K -->|LeaseExpired| L["internal / lease-expired"]
    K -->|MaxRunsExceeded| M["internal / max-runs-exceeded"]
    K -->|JobRejected| N["internal / job-rejected"]
    O[Executor: job creation failed] --> P["internal / job-creation-failed"]
    O2[Reconciliation: pod missing] --> P2["internal / pod-missing"]
Loading

Reviews (19): Last reviewed commit: "Merge branch 'master' into categorize-in..." | Re-trigger Greptile

Comment thread internal/executor/service/pod_issue_handler.go Outdated
@dejanzele dejanzele force-pushed the categorize-internal-errors branch 4 times, most recently from 0af53eb to 4cbb8a5 Compare June 22, 2026 13:07
@datadog-armadaproject

datadog-armadaproject Bot commented Jun 22, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 2 Pipeline jobs failed

CI | All jobs succeeded   View in Datadog   GitHub Actions

CI | test / Golang Integration Tests   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4cbb8a5 | Docs | Give us feedback!

@dejanzele dejanzele force-pushed the categorize-internal-errors branch 14 times, most recently from c4ab17f to 4129b79 Compare June 25, 2026 10:43
JamesMurkin
JamesMurkin previously approved these changes Jun 25, 2026
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the categorize-internal-errors branch from 9d28e68 to 72c5d1d Compare June 25, 2026 11:54
@dejanzele dejanzele requested a review from JamesMurkin June 25, 2026 12:16
@mergify

mergify Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@dejanzele dejanzele enabled auto-merge (squash) June 25, 2026 12:24
@dejanzele dejanzele merged commit 85b582d into armadaproject:master Jun 25, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants